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(54) Method and system for document storage management based on document content 



(57) A document storage management system and 
method that manages the storage of documents based 
upon the similarity of the content of the documents. 
Groups of documents are created based upon the sim- 
ilarity of the contents of the documents. Those groups 
are displayed to the user in a ranked lid of selectable 
groups to permit selection of a group or document. The 



storage of the selected group or document is then man- 
aged by, for example, deleting, compressing, or copy- 
ing. The displayed list may be ranked based upon a least 
recently used policy, the relevance to a predetermined 
topic, the size of the group, the radius of the group based 
upon the maximum distance of any document from the 
group centroid, the number of documents in the group 
and any other combination of parameters. 
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Description 

[0001] This invention is directed to a method and a system for managing the storage of documents. More particularly, 
this invention is directed to a method and a system for managing documents stored in a limited capacity storage device 

5 based on the content of the stored documents. 

[0002] Information access plays a key role across an ever expanding range of leisure and work activities. Laptop, 
hand-held and palm-top computers are being reinvented as "portable information appliances" with the promise of in- 
formation access anytime and anywhere. However, users of these devices have become reliant on continuous, high 
speed, low-cost networks available to computers, and portable computers are rarely connected to such networks. One 

io technique that makes portable computers less dependent upon networking is a cache. 

[0003] A cache is generally defined as a fast access memory that stores a copy of frequently referenced information. 
A cache reduces the reliance of the computer on a connection to a network and can provide documents even when 
disconnected from a network. Cache management systems control the information that is stored in a cache. 
[0004] An important part of a cache management system is the replacement policy. The replacement policy governs 

15 which items will be removed from the cache when the cache fills up, that is, there is insufficient space in the cache to 
store a new item. A replacement policy requires inherently difficult decisions because each decision involves predicting 
the future. The accuracy of those predictions can only be measured after several requests for information have been 
provided to the cache. The accuracy of the predictions is measured by the efficiency of the replacement policy. The 
efficiency of a replacement policy for a cache is determined by the ratio of hits (items found in the cache) to misses 
20 (items missing from the cache). 

[0005] The efficiency of a cache management system takes on greater importance for portable information appli- 
ances. When a computer is connected over a wireless network, and a cache miss occurs, it may not only take a 
relatively long time to locate and download the missing document, but the miss may, also result in expensive fees for 
the network connection usage for the download. If the cache had the requested document in local storage, there would 
25 be no need to download the document from a network and expensive access charges would be avoided. Moreover, 
when a system is disconnected from a network, a cache miss may stop the user from continuing work. In an attempt 
to solve these and other problems with cache management systems, researchers have been investigating new re- 
placement policies that better predict requests for documents or files. 

[0006] One attempt to solve these problems involves augmenting a computational replacement policy with direct 

30 user interaction. This is based on the assumption that people often know which is the most important information to 
keep in local storage. Two systems, TeleWeb and Mowgli, are mobile web browsers that allow users to lock documents 
into the local storage. Teleweb and Mowgli are described in "TeleWeb: Loosely Connected Access to the World-Wide 
Web", W.N. Schilit et al., Computer Networks and ISDN Systems. 28 pp. 1431-1444, 1996, and "Optimizing World- 
Wide Web for Weakly Connected Mobile Workstations: An Indirect Approach", T. Alanko et al., Proceeding of the 2nd 

35 International Workshop on Services in Distributed and Networked Environments (SDNE'95) , June 5-6, 1995, respec- 
tively, and are incorporated herein by reference in tbeir entireties. In TeleWeb, the storage contents are exposed to 
the user in a file by file listing and users may pin or lock items, or delete items. One problem with this type of user 
control is that the users tend to lock more than they delete. Therefore, the local Storage becomes much less effective 
because it becomes filled with locked, but no longer relevant, documents. 

40 [0007] Another approach taken by previous systems is to ask the user which documents are appropriate to discard 
and which documents are appropriate to keep. Such an approach is described in "How to Program Networked Portable 
Computers", D. Goldberg et al., Proceeding of the Fourth Workshop on Workstation Operating Systems , pp. 30-33, 
October 1993, incorporated herein by reference in its entirety When replacement is necessary, the system may, for 
example, use a pop-up dialog box to ask for suggestions on which file to remove from storage. A pop-up dialog box is 

45 shown in Fig. 1 . Fig. 1 shows a Graphical User Interface (GUI)10 displaying a list of documents 12. The GUI 1 0 allows 
the user to select a file to discard based on the title 

[0008] 14. The list may be sorted for the user according to the titles 14 of the documents, the sizes of the documents 
or the dates of last access. The appropriate sorting is invoked when the user clicks the heading for the appropriate 
column. One problem with this approach is that the user is forced to explicitly select documents. A user must explicitly 
50 designate those documents that the user wishes to manage. For example, a user is forced to formulate a search query 
to retrieve the appropriate documents. Therefore, freeing up storage space of any significant amount requires multiple 
interactions. Furthermore, it is often difficult to determine a file's importance from only its file name, size or date of last 
access. 

[0009] Some storage management systems try to predict file accesses by analyzing the history of file accesses. Two 
55 Systems have been implemented that allow users to turn on recording of file accesses. The systems are described in 
"Detection and Exploitation of File Working Sets", C. Tait et al., Proceedings of the 11th International Conference on 
Distributed Computing Systems. May 1 991 , pp. 2-9, and "Disconnected Operation in a Distributed File System", James 
Kistler, (1993) Ph.D. Thesis, School of Computer Science on file with the Carnegie Mellon University Library, incorpo- 
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rated herein by reference in their entiretie Traces of file accesses are used by these systems to hoard or prefetch files 
into the local storage. More recently, a system described in "Intelligent File Hoarding for Mobile Computers", C. Tait et 
al., MOBICOM 95, pp. 119-125, ACM, Inc., incorporated herein by reference in its entirety, extruded this concept to a 
graphic interface that permits users to select which of a number of profiles to hoard. These profiles are created by 
s observing patterns of file accesses to both application files and data files. However, this style of storage management 
only works well when users have recurring and predictable pattern of file accesses. 

[001 0] A number of systems have used automatic techniques to present groups of related documents to a user. One 
technique characterizes clusters of documents with keywords and titles of representative documents in order to support 
browsing or full-text retrieval. This technique is described in "Scatter/Gather: A Cluster-Based Approach to Browsing 
to Large Document Collections", DR. Cutting et al., Proceedings of the 15th Annual International ACM/SI GIR Conference , 
1992, ACM, Inc., and "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results", M A. Hearst et al., 
Proceedings of the 19th Annual International ACM/SIGIR Conference , Zurich 1996, incorporated herein by reference 
in their entireties. 

[001 1] Another system presents clusters of Web documents with keyword pairs based on their intra-cluster similarity, 
is and allows the user to expand the contents of clusters. Such a system is described in "Automatically Organizing Book- 
marks per Contents", YS. Maareck et al. , Computer Networks and ISDN Systems 28, pp. 1 321 -1 333 1 996, incorporated 
herein by reference in its entirety. However, these systems do not provide facilities for deleting, compressing, or taking 
other management actions on groups of documents. 

[0012] None of the above systems works well for personal information appliances that provide access to information 
20 that cannot be neatly organized and that do not have a recurring access pattern. 

[0013] A document management system is needed that augments the advantages of direct user interaction with a 
minimally invasive technique, does not fill up the cache with locked documents, and does not rely upon recurring access 
patterns. 

[0014] This invention provides a method and a system that allow users to manage storage space by choosing among 
25 groups of documents having similar content. The content of the documents may be presented to the user in groups 
identified by topics. Grouping documents allows the user to free up space quickly by avoiding expensive file-by-file 
decisions. Grouping documents also permits the user to determine which files need to be stored locally, eitherfor speed 
of access or to avoid interruptions when subsequently operating in a disconnected state, based upon the topics for 
which the user anticipates a future need. For example, Fig. 2 shows a GUI 20 according to one embodiment of this 
30 invention. The GUI 20 presents three groups from which the user may choose. Each group has a topic that may 
characterize the documents collected into that group. For example, as shown in Fig. 2, documents are collected into 
three groups, a first group 22 entitled "95 Windows Microsoft NT network workstation", a second group 24 entitled 
"mobile wireless network computing workstation system" and a third group 26 entitled "developer HTML web CGI 
Microsoft Program URL". Thus, the first group 22 includes documents related to networking with Windows 95 and NT 
35 the second group 24 includes documents related to issues in mobile Computing and wireless networks and the third 
group 26 includes documents related to developing web applications. 

[0015] The user can determine which files (i.e. documents) to remove, prefetch or hoard based upon these groups 
of documents that have similar content. The user can quickly select a group of documents having similar, but no longer 
relevant content, delete that group, and free up a large amount of space by removing the entire group. The user may 
40 also anticipate which groups of documents will be used in the future and prefetch or hoard these documents from an 
external source into the local storage based upoathe content of those documents. 

[0016] The user may rely solely upon the determination of the similarity of the content of the documents within the 
groups or may additionally rely upon the determination of the date and time of last access, the relevance of the groups 
to a predetermined topic, the total size of the group, the radius of the group, and/or the number of documents in the 

45 groups. Additionally, the groups may be ranked by the date and time of last access to the documents in the group. The 
groups may be, alternatively, ranked in accordance with any number or combination of group characteristics. 
[0017] The grouping of documents can also be used as a guideline for making document storage decisions For 
example, a group may be selected out of a listing of groups and the selected group may be expanded into a list of 
document from file selected group. Preferably, the document list is ranked in accordance with predetermined or user 

so defined attributes. The user is then flee to select one, several or all of the documents within the group for a storage 
management process, 

[0018] These and other futures and advantages of this invention are described in or are apparent from the following 
detailed description of the preferred embodiments. 

[0019] The preferred embodiments of this invention will be described in detail, with reference to the following figures, 
55 wherein: 

Fig. 1 is a graphical user interface of a conventional document storage management system; 

Fig. 2 is a graphical user interface showing documents grouped by the similarity of content of the documents in 
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accordance with this invention; 

Fig. 3 is e block diagram of an information system using the document storage management system of one em- 
bodiment of this invention; 

Fig. 4 is a flow chart outlining the operation of one embodiment of the document storage management system of 
5 this invention; 

Fig. 5 is a flow chart outlining document similarity grouping routine of the invention; and 
Fig. 6 is a block diagram of one embodiment of a processor of this invention. 

[0020] Fig. 3 shows a block diagram of an information system 30 operating with the document management system 
10 in accordance with this invention A cache 32 may be a portion of a local storage device such as a hard drive 34 or a 
memory device 36. In any case, the cache 32 is a local storage buffer that stores documents that are anticipated to 
be accessed by the information system 30. 

[0021] It should be appreciated that the document management system of this invention may extend to many types 
of information storage systems regardless of whether they are portable or non-portable. It should also be appreciated 
is that although this description generally refers only to "documents", that the term "document" also encompasses any 
file or object that may be grouped with other files based upon the similarity of the content of those files or objects. A 
"file" is intended to include any unit of information "Content" is intended to include any divisible or identifiable portion 
of a document such as abstracts, key words, titles or descriptions. 

[0022] It is to be understood that the term "document" is intended to include text, video, audio and any other medium 
20 and any combination of media Further, it is to be understood that -the term text is intended to include text, digital ink 
in stroke or bitmap format, audio, images, video or any other structure or content of a document. 
[0023] The information system 30 includes a processor 38 running software that embodies the document manage- 
ment system and which controls the document management system. A communication interface 40 allows the infor- 
mation system 30 to communicate with an external data source 42, such as another computer, a network, a file server, 
25 or a removable medium such as a portable hard disk or CDROM. The communication channel 44 established between 
the communication interface 40 and the externa! data source 42 may be interrupted or disconnected for a variety of 
reasons. The communication channel 44 may be expensive to maintain or the information system 30 may be a portable 
information system that is only intermittently connected to the external data source 42. 

[0024] The processor 38 communicates with an input/output interface 46 that communicates with any number of 
30 conventional input/output devices, such as a mouse 48, a keyboard 50, a display 52, and/or a pen 54. While Fig. 3 
only shows a display 52, it is to be understood that any type of presentation device that is appropriate for the type of 
document is intended 

[0025] The document management system manages the documents in the local storage device 34 by selectively 
removing documents from the local storage device 34 to create additional space in the local storage device 34. The 
35 document management system may also communicate through the communication channel 44 to the external data 
source 42 to download documents and store documents in the local storage device 34 in anticipation of future require- 
ments. 

[0026] As shown in Fig. 3, the system 30 is preferably implemented using a programmed general purpose computer 
However, the system 30 can also be implemented using a special purpose computer, a programmed microprocessor 
40 or microcontroller and any necessary peripheral integrated circuit elements, an ASIC or other integrated circuit, a 
hardwired electronic or logic circuit such as a discrete element circuit a programmable logic device such as a PLD, 
PLA, FPGA or PAL, or the like. In general, any device on which finite state machine capable of implementing the 
flowchart shown in Figs. 4 and 5 can be used to implement the system 30. 

[0027] Additionally as shown in Fig. 3, the memory 36 is preferably implemented using static or dynamic RAM. 

45 However, the local storage device 34, can also be implemented using a floppy disk and disk drive, a writable optical 
disk and disk drive, flash memory or the like. Additionally, it should be appreciated that the local storage device 34 can 
be either distinct portions ofa single memory or physically distinct memories. This is also the case with the memory 36. 
[0028] Furthermore, it should be appreciated that the links 44, 37 and 39 connecting the external data source 42 to 
the communication interface 40 and the processor 38 to the memory 36 and the hard drive 34, respectively, can be 

50 wired or wireless links to a network (not shown). The network can be a local area network, a wide area network, an 
intranet, the Internet or any other distributed processing and storage network. In this case, the electronic data is pulled 
from a physically remote external data source 42, the memory 36 or hard drive 34 through the links 44, 37 or 39 for 
processing in the processor 38 according to the method outlined below. Some portions of or the entire electronic 
document 22 can be stored locally in a portion of the memory 36, hard drive 34 or some other memory (not shown) of 

55 the system 30. 

[0029] Fig. 4 is a flow chart of the operation of one embodiment of the control routine of the document management 
system of this invention. The control routine starts at step S1 00, and continues to step SI1 0 where it receives a request 
to group the documents. The documents being grouped may be located in the cache 32 or in the external data source 
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42. The local documents are grouped when more storage space is needed in the cache 32 and the documents in the 
external data source 42 are grouped when the information system 30 is loading more documents into the cache 32. 
Grouping the documents relies upon procedures that are well known in the art. These procedures group documents 
based upon the similarity of the content of the documents. An example of such a procedure is disclosed in "Automatic 
s Text Processing", by Gerard Sulton, pp. 303-309, Addison -Wesley Publishing Company, Inc. (1989), which is incorpo- 
rated herein by reference in its entirely. 

[0030] The specific method chosen for grouping the documents is not crucial, because grouping can take place in 
the background. Therefore, the speed of the grouping method may not be important. The only requirement is that the 
system groups the documents based upon the similarity of the content in the documents. It is to be understood that 

10 the term "similarity" is intended to include any measure of a document or a portion of a document's relatedness oT 
relevance to another document or portion of a document. The similarity may be calculated, predetermined or determined 
by a user. The documents may be grouped by any conventional clustering algorithm Examples of clustering algorithms 
inlclude algorithms that cluster based on the size, radius of the document, the time ofa document's creation, last edit 
or last access or any other attribute of a document. Overlapping of groups may be allowed or prohibited. A specific 

15 example of a grouping method will be described in detail below. 

[0031] The control routine displays an ordered or ranked group listing at step S120. An example of a ranked group 
list is shown in Fig. 2. The control routine then determines, at step S130, whether a group has been selected by the 
user. If a group has been selected, the control routine continues to step S140. Otherwise, the control routine jumps to 
step S190. 

20 [0032] In step S140, the control system determines whether the user has requested that the selected group be 
expanded. If the control routine determines that the user has requested that the selected group be expanded, then the 
control routine continues to step S1 50. Otherwise, the control routine jumps to step S170. In step S150, the document 
management system displays a ranked list of the documents within the selected group Fig. 2 shows an expanded 
listing 28 of the selected group 22. 

25 [0033] If, at step S1 40, the control system determines that an expand group command has been received, then the 
control routine proceeds to step S1 50. At step S1 50 the control routine displays a ranked list of documents within the 
selected group. Each document within the group is selectable by a user. If the control system determines at step S160 
that a document has been selected, then the control routine proceeds to step S170. 

[0034] At step S170, the system determines whether a user has input a command to operate on the selected doc- 
30 ument or group of documents. If at step S170, the system receives a command, then the control routine continues to 
step S180. The command may include but is not limited to a command to delete the selected document or group of 
documents from the cache or to store the selected document or group of documents from the external data source 
into the cache. In step S180, the control routine executes the command on the selected document or group of docu- 
ments. The control routine then proceeds to step S1 90. 
35 [0035] Alternatively, if at step S160, no document is selected, then the control routine continues to step S190, Sim- 
ilarly, if at step S1 70, the control system does not receive a command, then the control routine also jumps to step S1 90, 
In step S190, the control routine stops, 

[0036] In step S1 70, the user of the method and the system of this invention may choose from any number of storage 
management commands. Examples of storage management commands include deletion compression, and copying 

40 of the selected document or group of documents. 

[0037] One method for determining the similarity of documents comprises the method outlined in the flow chart of 
Fig. 5, which begins in step S300. Next in step S310, individual words occurring in the documents of a collection are 
identified. Then, in step S320, a stop list of common function words ("and, - "of," "or," "but," "the," and the like) are used 
to delete high-frequency function words that are insufficiently specific to represent the content of the documents, Next, 

45 in step S330 an automatic suffix-stripping routine is used to reduce each remaining word to word-stem form. This 
routine reduces all words exhibiting the same stem to a common form. Control then continues to step S340. 
[0038] Next, in step S340, a typical document similarity coefficient is then obtained A description of a document 
similarity coefficient calculation of a method and System of one embodiment of this invention follows. For each remain- 
ing word stem Tj and document a weighting factor Wy is determined. This weighting factor W V] includes in part the 

50 term frequency and in part the inverse document-frequency for the term. For example, the weighting factor is 
determined as: 

W fl = tf ( xlog<£). (1) 

55 « 

[0039] Then, document vectors are calculated. A document vector for each document Dj is represented by the set 
of word stems together with the corresponding weighting factors: 
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D i = 0"l ■ W i1 « T 2- W i2 ; " T t> W it)i (2) 



[0040] or 



= <D M , D e D H ), (3) 



where D,j is a combination of the word stems Tj and the inverse document frequency weight Wy. 

[0041] Finally, the similarity coefficient is calculated between documents using a conventional inner-product formula: 



Sim(Dj,D 2 ) - £ Djj D 2 j- (4) 

[0042] Eq. (4) can be used with variable-length vectors that exhibit a variable number of included terms. However, 
when non-normalized document vectors are used, longer documents with more terms have a greater chance of match- 
ing than do the shorter document vectors. This, however, does not necessarily produce the best results. The document 
similarity factor of Eq. (4) is advantageously normalized: 

Sim(D h D 2 ) = -, Jml 



it < 5 > 
J £ (Djj) 2 I (DtoJ 2 



[0043] Eq. (5) represents the cosine of the angle between the document vectors considered as vectors in a space 
oft dimensions, where t is the number of distinct terms in the system. 

[0044] After the similarity coefficient is determined between document vectors in step S340, control continues to step 
S350. In step S350, documents are ranked in decreasing order of similarity Next, in step S360, groups are formed 
based on a similarity metric. For example, the documents may be grouped using an algorithm that starts with each 
document in its own group and iteratively merges groups. A group is merged with another group only if the merger 
results in a group with the smallest possible radius. That is, for example, if only the first and the second groups remain 
as potential mergers then the first group is examined to determine the radius of merged group that includes the first 
group and a second group is examined to determine the radius of a merged group that includes the second group. If 
the radius of the merged group that includes the first group is smaller than the merged group that includes the second 
group then the first group is merged. If there are more than two potential mergers then all potential mergers are ex- 
amined and only the merger that results in the smallest radius is performed. This process is repeated until no additional 
merging is possible without exceeding a radius limit. 

[0045] There are several attributes of documents or files that can be selected by the user to create good groups of 
candidates for removal. Groups that take up more storage space are more beneficial to remove. A group with low 
relevance that has been removed is less likely to incur a cache miss than a group with high relevance. A "narrow" 
group (one containing very similar documents) is a good candidate for removal because its narrowness minimizes the 
risk that users will inadvertently remove important documents because the short summary presented for a group does 
not resemble some outlying document that happens to be relevant. To assist the user, one embodiment of this invention 
uses a linear combination of the metrics of: 1) the time since the last access; 2) the relevance to the user's interests; 
3) the storage space for the entire group; 4) the number of documents; and 5) the radius of the group; to determine 
which groups to present to the user. 

[0046] The standard cache replacement policy is the Least Recently Used (LRU) policy, where a document's rele- 
vance is the inverse of the time since its last access. A document or group having a low relevance, because it has 
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been a long time since its last access, is the best candidate for removal. The system of this invention sets the LRU 
time for a group of documents to be that of the most recently accessed document. 

[0047] The system may optionally have an evolving query that approximates the user's interests, or some other 
mechanism for relevance feedback A relevance feedback process can be used to improve query formulation while 
5 altering the original queries in two substantial ways: 

1 . Terms present in previously retrieved documents that have been identified as relevant to the query are added 
to the original query formulations; and 

2. The weights of the original query terms are altered by replacing the inverse document-frequency portion of the 
10 weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previously 

retrieved relevant and non-relevant documents of the collection. 

[0048] Assuming that the initial query is available in the form specified in a manner similar to Eq. (3), a reformulated 
query will take the following form: 



15 



20 



Q= a(W q1 ,W q2 w ql )+ 

P( W qt + V W q1 + 2 W qt + n); (6) 



where 

a and p may take values between 0 and 1 ; and 

the weights of the t initial terms may be generated by taking combined term-frequency and inverse-document 
frequency weights. 

25 [0049] The weights of the newly added terms t t+ , to t t+m , on the other hand, may be determined as a combined term- 
frequency and a term-relevance weight. 

[0050] With a relevance feedback mechanism the relevance of the document to the user's interest can be estimated. 
The system can then base the relevance for a group of documents on the document having the maximum relevance 
within the group. 

30 [0051] The system may optionally have a user controllable device, such as a GUI interface for setting the minimum 
size of the groups. This allows the user to decide how much benefit they require for each interaction with the storage 
management interface. 

[0052] The system may also permit the user to determine the minimum or maximum width of groups that are to be 
presented. To compute the width of a group, the system uses the radius, which is defined as the maximum distance 
55 of any element from the centroid of the group. This metric is preferred to an intra-cluster similarity (e.g., mean square 
distance from the centroid) because it is more sensitive to outliers, and the outliers are the documents in a group that 
are the least similar in content to the other documents. 

[0053] The groups of documents may then be ranked in accordance with any one and/or combination of the previously 
described metrics or in accordance with a group score calculated from these metrics. A score for each group can be 
40 expressed as: 



Group Score = a * LRU + (7) 
45 P * (relevance) t 

y * (total site) + 
8 * (cluster radius) + 
e * (# of documents), 

50 

where a, y, and e are user defined constants. 

[0054] When the user is the cache, the groups with the lowest scores are presented to the user before groups having 
higher scores. Alternatively, groups with the highest scores are presented first when the system is being used to select 
documents to download from an external data source. 
55 [0055] The system of this invention presents the groups in a scrollable, explodable table 20, as shown in Fig. 2. For 
each group, the user sees the rank 60, distinguishing keywords 62, the number of documents contained within that 
group 64, the total size of the group (in kilobytes) 66 and a most recent time and date of access of the group 68. For 
groups with only one document, the list of keywords may be replaced with the document title. Initially, the table 20 is 



7 



INSDOCID: <EP 



0950965A2_I_> 



EP 0 950 965 A2 

sorted by the rank 60, but the user may sort the table 20 by the size 66 : the number of documents 64 or the time of 
last access 68. The user changes the sort variable by selecting the heading of the respective column, when a user 
explodes a group to a first level, the system displays the titles and attributes of the most representative documents 28. 
When a user explodes the group futher, all the documents in the group are displayed sorted by time of access 68. 

5 Users may select entire groups or several files within a group for removal. 

[0056] Fig. 6 shows a block diagram of one embodiment of the processor 38 or processing system of this invention. 
As shown in Fig. 6, the processor 38 is implemented using a general purpose computer 90 comprising a controller 86, 
a memory 88, a group generator 72, a keyword generator 74, a group score calculator 76, a group expander 78, a 
document/group selector 80, a storage manager 82 and a group filter 84. These elements of the general purpose 

10 computer 890 are interconnected by a bus 70. The group generator 72, the keyword generator 74, the group score 
calculator 76, the group expander 78, the document/ group selector 80, the storage manager 82 and the group filter 
84, controlled by controller 86, are used to implement the flow chart of Figs. 4 and 5. It should be appreciated that 
many other implementations of these elements will be apparent to those skilled in the art. 

[0057] The group generator 72 implements the control routine outlined in Fig. 5. The keyword generator 74 generates 
is keywords for each group. The keywords may then be used to indicate the content of the file to a user viewing a group 
listing. The keyword generator 74 may be a type of document summarizer. The group score calculator 76 generates a 
group score. An example method for calculating a group score has been detailed above. The group score may be used 
to rank the groups in a group listing. The group expander 78 expands a selected group into a list of documents contained 
within the selected group. As detailed earlier, a user may then select individual documents on which to perform various 
20 storage management functions. The document selection is performed using document/group selector 80. The storage 
manager 82 performs various storage management functions on the selected document(s) or group(s), The storage 
manager can perform many processes that are common to many conventional storage managers such as deleting, 
moving, copying, or adding a pointer to a document, flagging a document for a later process or compressing the document 
in the background. 

25 [0058] While this invention has been described with the specific embodiments outlined above, many alternatives, 
modifications and variations are apparent to those skilled in the art. Accordingly, the preferred embodiments described 
above are illustrative and not limiting, various changes maybe made without departing from the spirit and scope of the 
invention as defined in the following claims. 
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Claims 



1 . A document storage management system communicating with a document storage device, a plurality of documents 
stored on the document storage device, the system comprising: 

a group generator that generates at least one group of documents from the plurality of documents based on 
the content of the plurality of documents; 

a selector that is capable of selecting one of the at least one group of documents; and 
a storage manager that manages storage of the selected group of documents. 

2. The system of claim 1 , wherein the group generator generates the at least one group of documents based on the 
similarity of the content of the plurality of documents. 

3. The system of claim 1 or claim 2, wherein the group generator generates at least one group of documents based 
45 further on at least one attribute of each of the plurality of documents. 

4. A method for managing a document storage system, comprising: 

grouping a plurality of documents stored in a document storage device into a plurality of groups based on the 
50 content of the plurality of documents; 

selecting at least one of the plurality of groups; and 

managing the storage of the selected at least one of the plurality of groups. 



5. The method of claim 4, wherein the grouping is based on the similarity of the content of the plurality of documents. 

6. The method of claim 4 or claim 5, wherein the plurality of documents are grouped further based upon at least one 
attribute of each of the plurality of documents. 
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. A graphical user interface for managing the storage of a plurality of documents that are stored in a document 
storage device, the interface comprising: 

a display for displaying information to a user; 

at least one selectable group identifier visible on the display, the at least one selectable group identifier rep- 
resenting corresponding groups of documents, wherein the groups of documents comprise the plurality of 
documents that are grouped based on the content of the plurality of documents, wherein the at least one 
selectable group identifier is responsive to a selection of the at least one selectable group identifier to select 
a corresponding at least one group of documents; and 

a storage manager responsive to a command to perform a storage management function on the selected at 
least one group of documents. 

. The interface of claim 7, wherein the plurality of documents are grouped based on the similarity of the content of 
the plurality of documents. 

. The interface of claim 7 or claim 8, wherein the plurality of documents are grouped further based on at least one 
attribute of each of the plurality of documents. 

0. The interface of any of claims 7 to 9, wherein the at least one selectable group identifier is displayed in a list that 
is ordered based on at least one group characteristic. 
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(57) A document storage management system and 
method that manages the storage of documents based 
upon the similarity of the content of the documents. 
Groups of documents are created based upon the sim- 
ilarity of the contents of the documents. Those groups 
are displayed to the user in a ranked lid of selectable 
groups to permit selection of a group or document. The 



storage of the selected group or document is then man- 
aged by, for example, deleting, compressing, or copy- 
ing. The displayed list may be ranked based upon a least 
recently used policy, the relevance to a predetermined 
topic, the size of the group, the radius of the group based 
upon the maximum distance of any document from the 
group centroid, the number of documents in the group 
and any other combination of parameters. 
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